Polyphonic Sound Event Detection with Weak Labeling
نویسنده
چکیده
Sound event detection (SED) is the task of detecting the type and the onset and offset times of sound events in audio streams. It is useful for purposes such as multimedia retrieval and surveillance. Sound event detection is difficult in several aspects when compared with speech recognition: first, sound events are much more variable than phonemes, notably in terms of duration but also in terms of spectral characteristics; second, sound events often overlap with each other, which does not happen with phonemes. To train a system for sound event detection, it is conventionally necessary to know the type, onset time and offset time of each occurrence of a sound event. We call this type of annotation strong labeling. However, such annotation is not available in amounts large enough to support deep learning. This is due to multiple reasons: first, it is tedious to manually label each sound event with exact timing information; second, the onsets and offsets of long-lasting sound events (e.g. car passing by) and repeating sound events (e.g. footsteps) may not be well-defined. In reality, annotation of sound events often comes without exact timing information. We call such annotation weak labeling. Even though it contains incomplete information compared to strong labeling, weak labeling may come in larger amounts and is well worth exploiting. In this thesis, we propose to train deep learning models for SED using various levels of weak labeling. We start with sequential labeling, i.e. we know the sequences of sound events occurring in the training data, but without the onset and offset times. We show that the sound events can be learned and localized by a recurrent neural network (RNN) with a connectionist temporal classification (CTC) output layer, which is well suited for sequential supervision. Then we relax the supervision to presence/absence labeling, i.e. we only know whether each sound event is present or absent in each training recording. We solve SED with presence/absence labeling in the multiple instance learning (MIL) framework, and propose to analyze the network’s behavior on transient, continuous and intermittent sound events. As we explore the possibility of learning to detect sound events with weak labeling, we are often faced with the problem of data scarcity. To overcome this difficulty, we resort to transfer learning, in which we train neural networks for out-of-domain tasks on large data, and use the trained networks to extract features for SED. We make special effort to ensure the temporal resolution of such transfer learning feature extractors.
منابع مشابه
Giambattista Parascandolo Recurrent Neural Networks for Polyphonic Sound Event Detection
TAMPERE UNIVERSITY OF TECHNOLOGY Master‘s Degree Programme in Signal Processing PARASCANDOLO, GIAMBATTISTA: Recurrent neural networks for polyphonic sound event detection Master of Science Thesis, 66 pages November 2015 Major: Signal Processing Minor: Learning and Intelligent Systems Examiners: Tuomas Virtanen, Heikki Huttunen
متن کاملMetrics for Polyphonic Sound Event Detection
This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable ...
متن کاملBidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection
In this study, we propose a new method of polyphonic sound event detection based on a Bidirectional Long Short-Term Memory Hidden Markov Model hybrid system (BLSTM-HMM). We extend the hybrid model of neural network and HMM, which achieved stateof-the-art performance in the field of speech recognition, to the multi-label classification problem. This extension provides an explicit duration model ...
متن کاملContext-dependent sound event detection
The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in...
متن کاملA Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification
Sound event detection is the task of detecting the type, onset time, and offset time of sound events in audio streams. The mainstream solution is recurrent neural networks (RNNs), which usually predict the probability of each sound event at every time step. Connectionist temporal classification (CTC) has been applied in order to relax the need for exact annotations of onset and offset times; th...
متن کامل